Goto

Collaborating Authors

 Ashland


AquiLLM: a RAG Tool for Capturing Tacit Knowledge in Research Groups

Campbell, Chandler, Boscoe, Bernie, Do, Tuan

arXiv.org Artificial Intelligence

Research groups face persistent challenges in capturing, storing, and retrieving knowledge that is distributed across team members. Although structured data intended for analysis and publication is often well managed, much of a group's collective knowledge remains informal, fragmented, or undocumented--often passed down orally through meetings, mentoring, and day-to-day collaboration. This includes private resources such as emails, meeting notes, training materials, and ad hoc documentation. Together, these reflect the group's tacit knowledge--the informal, experience-based expertise that underlies much of their work. Accessing this knowledge can be difficult, requiring significant time and insider understanding. Retrieval-augmented generation (RAG) systems offer promising solutions by enabling users to query and generate responses grounded in relevant source material. However, most current RAG-LLM systems are oriented toward public documents and overlook the privacy concerns of internal research materials. We introduce AquiLLM (pronounced ah-quill-em), a lightweight, modular RAG system designed to meet the needs of research groups. AquiLLM supports varied document types and configurable privacy settings, enabling more effective access to both formal and informal knowledge within scholarly groups.


NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization

Kim, Hyuntak, Kim, Byung-Hak

arXiv.org Artificial Intelligence

Summarizing long-form narratives--such as books, movies, and TV scripts--requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline--without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.


Using different sources of ground truths and transfer learning to improve the generalization of photometric redshift estimation

Soriano, Jonathan, Saikrishnan, Srinath, Seenivasan, Vikram, Boscoe, Bernie, Singal, Jack, Do, Tuan

arXiv.org Artificial Intelligence

In this work, we explore methods to improve galaxy redshift predictions by combining different ground truths. Traditional machine learning models rely on training sets with known spectroscopic redshifts, which are precise but only represent a limited sample of galaxies. To make redshift models more generalizable to the broader galaxy population, we investigate transfer learning and directly combining ground truth redshifts derived from photometry and spectroscopy. We use the COSMOS2020 survey to create a dataset, TransferZ, which includes photometric redshift estimates derived from up to 35 imaging filters using template fitting. This dataset spans a wider range of galaxy types and colors compared to spectroscopic samples, though its redshift estimates are less accurate. We first train a base neural network on TransferZ and then refine it using transfer learning on a dataset of galaxies with more precise spectroscopic redshifts (GalaxiesML). In addition, we train a neural network on a combined dataset of TransferZ and GalaxiesML. Both methods reduce bias by $\sim$ 5x, RMS error by $\sim$ 1.5x, and catastrophic outlier rates by 1.3x on GalaxiesML, compared to a baseline trained only on TransferZ. However, we also find a reduction in performance for RMS and bias when evaluated on TransferZ data. Overall, our results demonstrate these approaches can meet cosmological requirements.


GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning

Do, Tuan, Boscoe, Bernie, Jones, Evan, Li, Yun Qi, Alfaro, Kevin

arXiv.org Artificial Intelligence

We present a dataset built for machine learning applications consisting of galaxy photometry, images, spectroscopic redshifts, and structural properties. This dataset comprises 286,401 galaxy images and photometry from the Hyper-Suprime-Cam Survey PDR2 in five imaging filters ($g,r,i,z,y$) with spectroscopically confirmed redshifts as ground truth. Such a dataset is important for machine learning applications because it is uniform, consistent, and has minimal outliers but still contains a realistic range of signal-to-noise ratios. We make this dataset public to help spur development of machine learning methods for the next generation of surveys such as Euclid and LSST. The aim of GalaxiesML is to provide a robust dataset that can be used not only for astrophysics but also for machine learning, where image properties cannot be validated by the human eye and are instead governed by physical laws. We describe the challenges associated with putting together a dataset from publicly available archives, including outlier rejection, duplication, establishing ground truths, and sample selection. This is one of the largest public machine learning-ready training sets of its kind with redshifts ranging from 0.01 to 4. The redshift distribution of this sample peaks at redshift of 1.5 and falls off rapidly beyond redshift 2.5. We also include an example application of this dataset for redshift estimation, demonstrating that using images for redshift estimation produces more accurate results compared to using photometry alone. For example, the bias in redshift estimate is a factor of 10 lower when using images between redshift of 0.1 to 1.25 compared to photometry alone. Results from dataset such as this will help inform us on how to best make use of data from the next generation of galaxy surveys.


Using Galaxy Evolution as Source of Physics-Based Ground Truth for Generative Models

Li, Yun Qi, Do, Tuan, Jones, Evan, Boscoe, Bernie, Alfaro, Kevin, Nguyen, Zooey

arXiv.org Artificial Intelligence

Generative models producing images have enormous potential to advance discoveries across scientific fields and require metrics capable of quantifying the high dimensional output. We propose that astrophysics data, such as galaxy images, can test generative models with additional physics-motivated ground truths in addition to human judgment. For example, galaxies in the Universe form and change over billions of years, following physical laws and relationships that are both easy to characterize and difficult to encode in generative models. We build a conditional denoising diffusion probabilistic model (DDPM) and a conditional variational autoencoder (CVAE) and test their ability to generate realistic galaxies conditioned on their redshifts (galaxy ages). This is one of the first studies to probe these generative models using physically motivated metrics. We find that both models produce comparable realistic galaxies based on human evaluation, but our physics-based metrics are better able to discern the strengths and weaknesses of the generative models. Overall, the DDPM model performs better than the CVAE on the majority of the physics-based metrics. Ultimately, if we can show that generative models can learn the physics of galaxy evolution, they have the potential to unlock new astrophysical discoveries.


Catching fire: AI helps scarce firefighters better predict blazes

#artificialintelligence

LOS ANGELES, July 22 (Thomson Reuters Foundation) - Last summer, as Will Harling captained a fire engine trying to control a wildfire that had burst out of northern California's Klamath National Forest, overrun a firebreak and raced towards his hometown, he got a frustrating email. It was a statistical analysis from Oregon State University forestry researcher Chris Dunn, predicting that the spot where firefighters had built the firebreak, on top of a ridge a few miles out of town, had only a 10% chance of stopping the blaze. "They had spent so many resources building that useless break," said Harling, who directs the Mid Klamath Watershed Council, and works as a wildland firefighter for the local Karuk Tribe. "The index showed it had no chance," he told the Thomson Reuters Foundation in a phone interview. The Suppression Difficulty Index (SDI) is one of a number of analytical tools Dunn and other firefighting technology experts are building to bring the latest in machine learning, big data and forecasting to the world of firefighting.


Some Data Scientist New Year Resolutions for 2017

#artificialintelligence

I've never been very big on New Year's resolutions. I've tried them in the past, and while they are nice to think about, they are always overly vague, difficult to accomplish in a year, trite, or just don't get done (or attempted). This year I decided to try something different instead of just not making resolutions at all. I set out some professional goals for myself as a Data Scientist. Open source software is only as good as its community and/or developer(s).